Discovering data quality rules

نویسندگان

  • Fei Chiang
  • Renée J. Miller
چکیده

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we propose a new data-driven tool that can be used within an organization’s data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records). We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Distributed-Population Genetic Algorithm for Discovering Interesting Prediction Rules

In data mining the quality of prediction rules basically involves three criteria: accuracy, comprehensible and interestingness. The majority of the rule induction literature focuses on discovering accurate, comprehensible rules. In this paper we also take these two criteria into account, but we go beyond them in the sense that we aim at discovering rules that are interesting (surprising) for th...

متن کامل

Discovering Data Quality Rules in a Master Data Management

Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject whe...

متن کامل

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Dept. Of Comp. Sci. & Eng. Vaagdevi college of Eng. Warangal, India [email protected] Abstract Frequent pattern mining is an important area of data mining used to generate the Association Rules. The extracted Frequent Patterns quality is a big concern, as it generates huge sets of rules and many of them are redundant. Mining Non-Redundant Frequent patterns is a big concern in the area of Ass...

متن کامل

Automating Objective Data Quality Assessment (experiences

The paper discusses the design goals, current architecture and overall experiences gained in the process of building a software tool aiding human analysts in estimating approximate information quality (information accuracy) of an unknown relational data source. We discuss the algorithms and techniques that we found effective. In particular, we discuss the automated reasoning techniques used to ...

متن کامل

Cost of Low-Quality Data over Association Rules Discovery

Quality in data mining critically depends on the preparation and on the quality of processed data sets. Indeed data mining processes and applications require various forms of data preparation (and repair) with several data formatting and cleaning techniques, because the data input to the mining algorithms is assumed to conform to nice data distributions, containing no missing, inconsistent or i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2008